Analyzing product reviews is essential as it gives insights on how people feel, think, and react to a certain product. For this project, we will be analyzing reviews of two main products, Apple and Samsung phones, each of which has different models. The main purpose of this project is to analyze and compare the interaction of the consumers on Amazon’s website with each of the models of those two phone brands, and how they describe them.
We will be studying the interaction by analyzing the difference in review texts between these two brands by consumers over time and distinguishing similarities as well as differences in the vocabulary used to describe the phone from all perspectives such as functionality, memory, state, etc. To get a better overview and more relevant results with respect to the technology used and the development of mobile phones, we will be focusing only on the newest three generations for each brand; Apple and Samsung phones that have been introduced in the years 2019, 2020, and 2021. The following are the models that this project will focus on:
We will start by describing the data that is scraped from Amazon and the procedures taken to treat, clean, and prepare the data to transform it into a corpus with some graphical analysis. Further, we will perform a sentiment analysis to understand how consumers feel and react to these brands generally and models specifically. Then we will use unsupervised learning techniques, such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA), to identify clusters and similarities in the reviews and to analyze topic attribution based on the model being reviewed. Finally, we will use supervised learning techniques, such as Random Forest, to predict future values such as sentiment prediction based on the brand. All the above will be used to conclude the consumers reactions to each of the brands and their models.
This section explains the steps we did for the retrieval of the data from the US Amazon marketplace. In Addition, we will explain some of the tasks we had to apply to the text reviews in order to have the final tables. It is worth mentioning that due to the heavy computation on this section we will present the final outcomes and the examples (before and after) of the wrangling and cleaning. The code can be observed on the scrapping.RMD file.
During this section, we have decided to retrieve the reviews of the eleventh, twelfth, and the thirteenth generations of Apple. Also, we have consider the Generation S20, 2S21, and S22 for Samsung. The phones concerning each of the generation with the amount of reviews are described as follows:
From these numbers above that were obtained from US amazon, we can deduce that consumers that buy Apple mobile phones from Amazon are more than consumers who buy Samsung phones as we observe that users are more willingly to review the phones from Apple. Furthermore, the Generation 11 of Apple had the largest amount of reviews compared to the rest with a total of 10,276 reviews.
Below find an example of how the data looks
| Brand | Model | Reviews | |
|---|---|---|---|
| 12983 | Samsung | Samsung Galaxy S20 FE | Good phone with terrible battery |
| 12984 | Samsung | Samsung Galaxy S20 FE | Phone overheat alot |
| 12985 | Samsung | Samsung Galaxy S20 FE | No issue with the phone at all, gave this as a gift to my boyfriend for Christmas. Works amazing! Thank you so much! |
| 12986 | Samsung | Samsung Galaxy S20 FE | It was not full unlocked as described. |
| 12987 | Samsung | Samsung Galaxy S20 FE | Excelente terminal, muy buena relación, calidad precio. Recomendado |
| 12988 | Samsung | Samsung Galaxy S20 FE | Mi hijo estaba muy contento. Súper recomendable |
| 12989 | Samsung | Samsung Galaxy S20 FE | Me gusto mucho el teléfono, pero lo compre supuestamente desbloqueado para todas las compañías y no fue así, no lo puedo usar con Verizon y sprint y nunca dijeron nada a respecto eso fue lo que me decepcionó. |
| 12990 | Samsung | Samsung Galaxy S20 FE | The product worked as described by seller |
| 12991 | Samsung | Samsung Galaxy S20 FE | rendimiento espectacular y fotos increíbles!! |
| 12992 | Samsung | Samsung Galaxy S20 FE | Good |
After scrapping the reviews from the smartphones we noticed that some
of them were written in various languages other than English such as
Spanish, Japanese and Hindi. For that reason, we decided to use
Transformers from the Hugging Face 🤗 website throughout the
pipelines() function in order to apply tasks such as
text classification and text translations. First, we
applied the classification task from the model eleldar/language-detection,
which is a fine-tuned version of xlm-roberta-base on
the Language Identification dataset. By using this model we were able to
detect the language of the reviews from our dataset.
After the application of the task text-classification
and the model to the dataset, we created a column called
Language to determine the language and the amount of reviews of
each of them.
| Brand | Model | Reviews | Language | |
|---|---|---|---|---|
| 1239 | Apple | iPhone 11 Pro Max | This is the second phone from Amazon I’ve tried to get onto my T-Mobile/Sprint account. Amazon and Apple swear it’s fully unlocked but multiple people at T-Mobile say I can’t use this phone because it’s locked to Verizon. Just buy from your carrier lol | en |
| 1240 | Apple | iPhone 11 Pro Max | Amazing! Yes It’s Legit &’ Works As Advertised 10/10 | en |
| 1241 | Apple | iPhone 11 Pro Max | El celular tenias muchos rayones notorios en la pantalla, se q es usado pero no me pareció confiable después de ver esas imperfecciones tan notorias. | es |
| 1242 | Apple | iPhone 11 Pro Max | El producto no es 100% desbloqueado | es |
| 1243 | Apple | iPhone 11 Pro Max | This type is good | en |
| 1244 | Apple | iPhone 11 Pro Max | But the celular that I bought is not original, they sent me another charger that is not the celular one. others besides they made me pay dearly and without guarantee … I did not understand that. Thank you | en |
| Brand | Model | Reviews | Language | |
|---|---|---|---|---|
| 51 | Samsung | Samsung Galaxy S20 FE | Love love love this phone. Absolutely no problems with it. | en |
| 52 | Samsung | Samsung Galaxy S20 FE | A date mon garçon ne l a pas tuer…et c’est un gros gamer… | fr |
| 53 | Samsung | Samsung Galaxy S20 FE | i think i made a very wise decision when I bought this phone. with the features and specifications that it has, no wonder some say that it was the best phone in its range. if you are looking for a high spec but has a limited budget, this phone is the best choice! | en |
| 54 | Samsung | Samsung Galaxy S20 FE | Battery is worse than battery test :)) | en |
| 55 | Samsung | Samsung Galaxy S20 FE | This arrived in just a few days via FEDEX with no problems at all. Be aware that the charger included is the massive European plug from Samsung. You will need an adapter. As far as the phone, it is a great phone with a great camera. Well worth the budget-friendly price compared to the more expensive options available. | en |
| 56 | Samsung | Samsung Galaxy S20 FE | 概ね満足です。 不備などはありませんでした。 | ja |
| 57 | Samsung | Samsung Galaxy S20 FE | マイネオauシムで使用マイネオも5gが12月開始ですがとりあえず4gで使用 | ja |
For this part, we used the task for text translations
from the model Helsinki-NLP/opus-mt-es-en,
which helped us to translate the reviews written in Spanish to English.
We have only considered to translate this language because it was the
second most representative language in our data. Languages like French,
Japanese, and Hindi that we identified in our data had less than 10
observations.
For visualization purposes, we have combined both tables (input and output), to show how the translation was executed. This means that we kept only the reviews translated to English for our Analysis.
| Brand | Model | Reviews | Language | |
|---|---|---|---|---|
| a.10 | Apple | iPhone 11 Pro Max | El teléfono estaba desbloqueado, batería 90%. Muy bueno. | es |
| b.10 | Apple | iPhone 11 Pro Max | The phone was unlocked, battery 90%. | en |
| a.11 | Apple | iPhone 11 Pro Max | El celular funciona regularmente bien, el sensor de proximidad no estaba funcionando bien pero no era mayor problema…. La batería estaba consumida pero definitivamente creo es la mejor batería que hecho iPhone dura todoooo el día, no recomiendo celulares usados por ese precio deberían costar menos ya que siempre viene con problemas los cuales result caro solucionarlos al menos en Ecuador | es |
| b.11 | Apple | iPhone 11 Pro Max | The cell phone works regularly well, the proximity sensor wasn’t working well but it wasn’t a major problem…. The battery was consumed but I definitely think it’s the best battery that made iPhone lasts alloooo the day, I don’t recommend used phones for that price should cost less as it always comes with problems which result expensive to fix them at least in Ecuador | en |
| a.12 | Apple | iPhone 11 Pro Max | Todo el telefono esta exelente desbloqueado y trabaja bien | es |
| b.12 | Apple | iPhone 11 Pro Max | The whole phone is exelent unlocked and works well. | en |
| Brand | Model | Reviews | Language | |
|---|---|---|---|---|
| a.1 | Samsung | Samsung Galaxy S20 FE | Buen producto me gusto mucho | es |
| b.1 | Samsung | Samsung Galaxy S20 FE | Good product I liked very much | en |
| a.2 | Samsung | Samsung Galaxy S20 FE | Buen teléfono, aunque no es la versión 5g, y los marcos son más grandes de lo que esperaba | es |
| b.2 | Samsung | Samsung Galaxy S20 FE | Good phone, though it’s not version 5g, and the frames are bigger than I expected. | en |
| a.3 | Samsung | Samsung Galaxy S20 FE | Excelente terminal, muy buena relación, calidad precio. Recomendado | es |
| b.3 | Samsung | Samsung Galaxy S20 FE | Excellent terminal, very good ratio, quality price. Recommended | en |
##{-}
After dealing with the steps explained above, we have
transformed our files from having reviews with different languages to a
final file with reviews in English. Thus, the number of reviews has
slightly decreased, in the case of Apple from 12,965 obs to
12,712 obs, and in the case of Samsung from 2,606 obs
to 2,585.
Now that we have all our text in English solely, we can start processing our dataset to transform it into a corpus. The second step is to create tokens - each token will be assigned a word. To get rid of non conforming formats, we are removing in our corpus any punctuation; symbols; numbers; and separators. To have a better analysis we also decided to remove “stop words” which correspond to parasite words such as “a”, “the”, and others that do not add value to the analysis. Finally instead of using a steming method, we decided to proceed with the lemmatization technique - this correspond to the usage of a lexicon dictionary that will look for the root of words, in order to get rid of unimportant repetition with minor change such as teach-teaching-taught will all be reduced to the root teach. Prior to continue with graphical representation, we also compute the following information:
Below you will find two tables representing the corpus text of smartphones reviews that we are going to use. The first one is representing the corpus summary (the text column shows the Model name with the numeration of each review across the dataset so we could identify them) while the second one is grouped by Model type.
| Text | Types | Tokens | Sentences | Brand | Model |
|---|---|---|---|---|---|
| iPhone 11 Pro Max_1 | 26 | 28 | 1 | Apple | iPhone 11 Pro Max |
| iPhone 11 Pro Max_2 | 21 | 28 | 3 | Apple | iPhone 11 Pro Max |
| iPhone 11 Pro Max_3 | 4 | 4 | 1 | Apple | iPhone 11 Pro Max |
| iPhone 11 Pro Max_4 | 22 | 23 | 1 | Apple | iPhone 11 Pro Max |
| iPhone 11 Pro Max_5 | 253 | 636 | 46 | Apple | iPhone 11 Pro Max |
| iPhone 11 Pro Max_6 | 35 | 44 | 2 | Apple | iPhone 11 Pro Max |
| iPhone 11 Pro Max_7 | 18 | 18 | 1 | Apple | iPhone 11 Pro Max |
| iPhone 11 Pro Max_8 | 41 | 50 | 4 | Apple | iPhone 11 Pro Max |
| iPhone 11 Pro Max_9 | 39 | 56 | 3 | Apple | iPhone 11 Pro Max |
| iPhone 11 Pro Max_10 | 15 | 16 | 2 | Apple | iPhone 11 Pro Max |
| Text | Types | Tokens | Sentences | Brand | Model |
|---|---|---|---|---|---|
| iPhone 11 | 7514 | 181200 | 9819 | Apple | iPhone 11 |
| iPhone 11 Pro | 6081 | 107050 | 5259 | Apple | iPhone 11 Pro |
| iPhone 11 Pro Max | 5364 | 85350 | 4251 | Apple | iPhone 11 Pro Max |
| iPhone 12 | 3490 | 36055 | 1868 | Apple | iPhone 12 |
| iPhone 12 mini | 3257 | 31869 | 1855 | Apple | iPhone 12 mini |
| iPhone 12 Pro | 1999 | 12841 | 717 | Apple | iPhone 12 Pro |
| iPhone 12 Pro Max | 990 | 3806 | 193 | Apple | iPhone 12 Pro Max |
| iPhone 13 | 1518 | 7307 | 397 | Apple | iPhone 13 |
| iPhone 13 mini | 1244 | 6017 | 354 | Apple | iPhone 13 mini |
| iPhone 13 Pro | 656 | 2096 | 121 | Apple | iPhone 13 Pro |
| iPhone 13 Pro Max | 706 | 2244 | 107 | Apple | iPhone 13 Pro Max |
| Samsung Galaxy 21 Ultra | 481 | 1233 | 71 | Samsung | Samsung Galaxy 21 Ultra |
| Samsung Galaxy 22 Ultra | 1846 | 8822 | 504 | Samsung | Samsung Galaxy 22 Ultra |
| Samsung Galaxy Note 20 Ultra | 673 | 1783 | 100 | Samsung | Samsung Galaxy Note 20 Ultra |
| Samsung Galaxy S20 FE | 587 | 1613 | 98 | Samsung | Samsung Galaxy S20 FE |
| Samsung Galaxy S20 Plus | 1264 | 5095 | 295 | Samsung | Samsung Galaxy S20 Plus |
| Samsung Galaxy S21 FE | 7378 | 111961 | 6460 | Samsung | Samsung Galaxy S21 FE |
| Samsung Galaxy S21 Plus | 2806 | 18504 | 1089 | Samsung | Samsung Galaxy S21 Plus |
| Samsung Galaxy S22 | 947 | 3355 | 186 | Samsung | Samsung Galaxy S22 |
| Samsung Galaxy S22 Plus | 1017 | 3818 | 210 | Samsung | Samsung Galaxy S22 Plus |
Observing the frequency plot we can analyse that the most common word used was “phone” followed by “battery” and “screen”. While phone comes without surprise as we got our information from phones reviews on amazon, an interesting point came from the two most used words afterwards. Indeed, we can deduce from this graph that the hottest topic for a consumers is the battery life of a new phone and the quality of its screen rather than the software behind it nor the features added.
This graph focused on the frequency of the 5 more common words used for each of the documents available. An interesting thing, to note is that in our dataset more than 80% of the reviews came from apple buyers. Nonetheless, in this sample showed, it appears that most words used are related to android instead of IOS. Another interesting fact to note is that the most common words used all throughout the corpus are - samsung, s9, s20, phone, onplus, entry, device, camera, apple.
We observed that the Models chosen by the DFM on the top part of the chart are the ones with the highest amount of Tokens. The common ground across the models is the impression that customers talked about the phone and in some cases about the seller. As an example, we discover that on the iPhone 11 some of the reviews were about the conditions of the smartphones as they were refurbished and users were praising the seller or product or complaining about what they had received. Some of the top 5 terms that were used were - screen, scratch, phone, iphone, buy, battery - this concurs with the previous understanding. Furthermore, the usage of the term phone is used among all models while scratch is mostly used in iPhone 11 reviews.
On the bottom part of the chart, we have the same models as seen on top, but the difference lies on the relevance on the token used in regard to the review. This implies that the terms shown explain the main context of the review. Thereby, iPhone 11 model was explained by scratch as the main theme of the reviews, meanwhile, for Samsung Galaxy S21 FE is all about comparison with the previous model (Samsung Galaxy S20 FE). Also we detected that the iPhone 11 Pro’s reviews followed the same pattern as the iPhone 11, but with the main difference being the terms aesthetic and generic.
Due to the large amount of reviews, we decided to create a visualization of the max TF-IDF for each documents instead of showing each one of them. Thereby, the following representation tell us that each of those words have at least a large TF-IDF in on document. Analyzing the output, entry seems to be the word having the most relevance followed then by “oneplus” “s9” “s20”. While the words ranked 2 to 4 are all specific to model, it turns out that “entry” is what matters the most for consumers. Entry in the context of purchase might correspond to leader price.
We applied the same procedure as the chart above, but in this case we chose to group the documents by Model so we could identify which terms have the largest TF-IDF on each smartphone. From the chart, we can interpret the following:
While this log frequency representation is quite messy due to the amount of documents used, we can still distinguish the tokens found during the frequency plot representation. It is very clear indeed that the words phone, battery, and screen are the most common words used and are present in nearly every document. On the other hand, we observe - entry - having a lesser document frequency, meaning that it appears less often, while still maintaining a decent log-frequency implying that it is specific to some documents only.
Another representation of the Document frequency matrix is through a word cloud, where the the most relevant words has the biggest size. From this cloud, we can identify - phone, battery, screen, iPhone, scratch, condition - as the most used words.
We now want to take a glance a the diversity of words used. This is an interesting approach as it allows us to see if reviews are rich in vocabulary or if instead, they are repetitive. Due to the number of documents, we will only show a sample. A limitation of this representation is that the lexical diversity is dependent on the size of the sentence, in this case the reviews. Nonetheless it can provide us with insights on how diverse the reviewers are in term of words used. The highest the TTR the more diverse the lexicon is.
For instance, what we can see on the chart is that in the model Samsung Galaxy 21 Ultra we note a higher lexical diversity than the rest of the models. If we look at it with a bigger lens, the top models with the highest lexical assortment are from the brand Samsung followed only by Pro version of Apple thirteenth generation. Considering the model with the highest number of tokens, iPhone 11 has the lowest lexical distinction.
To continue with our analysis we decided to apply a Chi-square test of independence between Apple’s and Samsung’s reviews. The purpose of this test is to compare term from a set of document to another, in this case we want to put in perspective terms used from one consumers base to an others. We created a plot with the keyness results, we also added for reference two tables containing the 10 first values from the Chi-square test for both Samsung and Apple as target.
From the graph, we can see that both reviews will have as most common words used their respective brand. The main difference lays in the following exclusive words used. It seems that for Samsung most of the term are related to other models (Apple as reference). Meanwhile, if Apple becomes the target (Samsung as reference), it appears that the most common terms used are - condition, and scratches.
Our intuition behind this pattern, could be due for Samsung to the amount of product they proposed, thus reviewers have a larger set of comparison when reviewing:
as for Apple:
| feature | chi2 | p | n_target | n_reference |
|---|---|---|---|---|
| samsung | 2326 | 0 | 886 | 47 |
| s20 | 684 | 0 | 241 | 1 |
| 5g | 667 | 0 | 283 | 33 |
| galaxy | 624 | 0 | 254 | 23 |
| fe | 563 | 0 | 197 | 0 |
| fingerprint | 475 | 0 | 240 | 54 |
| s21 | 446 | 0 | 169 | 8 |
| s9 | 318 | 0 | 113 | 1 |
| reader | 316 | 0 | 161 | 37 |
| sd | 265 | 0 | 96 | 2 |
| feature | chi2 | p | n_target | n_reference |
|---|---|---|---|---|
| iphone | 694 | 0 | 2478 | 84 |
| condition | 578 | 0 | 2032 | 64 |
| scratches | 571 | 0 | 1747 | 20 |
| apple | 241 | 0 | 1026 | 58 |
| renewed | 232 | 0 | 724 | 10 |
| battery | 229 | 0 | 4000 | 789 |
| perfect | 210 | 0 | 1180 | 107 |
| arrived | 197 | 0 | 995 | 77 |
| seller | 193 | 0 | 810 | 44 |
| product | 191 | 0 | 1278 | 141 |
To get an idea of the relationship between different words such as the most common combinations available, we decided to present a visual representation of those links. From the following graph we reconfirm that all the reviews are related to the term phone. It can be noted however, that on the outer reach of this plot, terms such as perfect, excellent, recommend, happy, and others are reviews that mostly appear only with one token, so they are reviews with only one word as a feedback. This could explain the reason why on this graph these terms do not appear with links to others tokens.
This section aims at attributing a sentiment score to each review. We
have decided to work with the nrc and afinn
dictionaries from the tidytext package as well as with the
sentimentr package. This approach proposes more granular
sentiment scoring than the tidyverse and
quanteda packages which only provide positive, negative or
negative-positive classification.
We used all three datasets for this analysis, namely the smartphone_reviews_final.csv, Apple_final.csv and Samsung_final.csv. To prepare the data, we decided to tokenize the reviews by lowercasing, removing punctuation and removing numbers. Below is a glimpse to the tokenized version of the reviews of the smartphone_reviews_final.csv dataset which includes all the reviews (Apple and Samsung).
# # Read data
all_reviews <- read.csv(here::here("data/smartphone_reviews_final.csv"))
apple <- read.csv(here::here("data/Apple_final.csv"))
samsung <- read.csv(here::here("data/Samsung_final.csv"))
# # Unnest tokens
all_reviews_token <- all_reviews %>%
mutate(review_id = seq(1:nrow(all_reviews))) %>%
relocate(review_id, .before = "Brand") %>%
unnest_tokens(output = "word",
input = "Reviews",
to_lower = TRUE,
strip_punct = TRUE,
strip_numeric = TRUE)
apple_token <- apple %>%
mutate(review_id = seq(1:nrow(apple))) %>%
relocate(review_id, .before = "Brand") %>%
unnest_tokens(output = "word",
input = "Reviews",
to_lower = TRUE,
strip_punct = TRUE,
strip_numeric = TRUE)
samsung_token <- samsung %>%
mutate(review_id = seq(1:nrow(samsung))) %>%
relocate(review_id, .before = "Brand") %>%
unnest_tokens(output = "word",
input = "Reviews",
to_lower = TRUE,
strip_punct = TRUE,
strip_numeric = TRUE)
#> 'data.frame': 553049 obs. of 4 variables:
#> $ review_id: int 1 1 1 1 1 1 1 1 1 1 ...
#> $ Brand : chr "Apple" "Apple" "Apple" "Apple" ...
#> $ Model : chr "iPhone 11 Pro Max" "iPhone 11 Pro Max" "iPhone"..
#> $ word : chr "i" "received" "the" "iphone" ...
Using the tokenized version of the reviews, we first proceeded to a
sentiment analysis using thenrc dictionary from the
tidytext package to compare overall sentiment scores
between Apple and Samsung reviews. We created our own function called
sentiment_function in order to return the data in the
desired format. See the function code and its application below.
# # This function runs a sentiment analysis using the 'nrc' dictionary
sentiment_function <- function(token_data, id_column, sentiment_column){
sentiment_data <- inner_join(token_data, get_sentiments("nrc"),
by = c("word" = "word"))
sentiment_matrix <- table(sentiment_data[[id_column]],
sentiment_data[[sentiment_column]])
return(sentiment_matrix)
}
# # Get sentiment analysis: Apple vs. Samsung
# Sentiment based using "nrc" dictionary
for (i in list(apple_token, samsung_token)){
# Perform the sentiment analysis
sentiment_analysis <- sentiment_function(i, "review_id", "sentiment")
# Print the plot
sentiment_plot <- ggplot(tibble(sentiment = names(colSums(sentiment_analysis)),
sum_value = colSums(sentiment_analysis)),
aes(x = reorder(sentiment, -sum_value), y = sum_value)) +
geom_col() +
ggtitle(i$Brand[1]) +
ylab("Number of tokens") +
xlab("Sentiment") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5))
# Assign graph
assign(paste0("sentiment_", i$Brand[1]), sentiment_plot)
}
This graph shows some signals that Apple reviews contains more sentiments linked to trust and joy than the ones of Samsung. Apple phones also seems to trigger relatively less sadness, fear and disgust compared to its counter part.
Secondly, we proceeded to a sentiment analysis using the
afinn dictionary and the sentimentr function.
Again, we will show the function we created to return the results in the
desired formats as well as an example of its implementation.
# This function runs 2 sentiment analysis using 'afinn' dictionary and sentimentr package
# Note that if the parameters for one of the two (or the two) sentiment analysis
# are empty, it will only run the one for which it has parameters or return NULL
value_function <- function(data_text=NULL, data_token=NULL, id_column=NULL, value_column=NULL){
data_value_afinn <- NULL
data_value_sentimentr <- NULL
if (!is.null(data_token) | !is.null(data_text)){
if (!is.null(data_token)){
# Get "afinn" sentiment value
data_value_afinn <- inner_join(data_token, get_sentiments("afinn"),
by = c("word" = "word")) %>%
group_by(review_id = {{id_column}}) %>%
summarise(afinn_value = mean({{value_column}})) %>%
melt(id.vars = "review_id")
}
if (!is.null(data_text)){
# Get "sentimentr" sentiment value
data_value_sentimentr <- get_sentences(data_text) %>%
sentiment() %>%
group_by(review_id = element_id) %>%
summarise(sentimentr_value = mean(sentiment)) %>%
melt(id.vars = "review_id")
}
if (!is.null(data_value_afinn) & !is.null(data_value_sentimentr)){
# Join the data into one table
data_value <- rbind(data_value_afinn, data_value_sentimentr)
return(data_value)
} else if (!is.null(data_value_afinn)){
return(data_value_afinn)
} else if (!is.null(data_value_sentimentr)){
return(data_value_sentimentr)
} else {
return(NULL)
}
}
}
# Value based using "afinn" dictionary and "sentimentr"
# Valence shifter: https://www.r-bloggers.com/2020/04/sentiment-analysis-in-r-with-sentimentr-that-handles-negation-valence-shifters/
index <- 1
data_list <- list(apple, samsung)
for (i in list(apple_token, samsung_token)){
# Perform the sentiment value analysis
value_analysis <- value_function(data_list[[index]]$Reviews, i, review_id, value)
# Print the plot
value_plot <- ggplot(value_analysis, aes(y = value, fill = variable)) +
geom_boxplot() +
ylab("Average sentiment value") +
ggtitle(i$Brand[1]) +
scale_y_continuous(breaks = seq(0,5,0.2), limits = c(0,5)) +
guides(fill=guide_legend("Sentiment method")) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
# Assign graph
assign(paste0("value_", i$Brand[1]), value_plot)
# Increment index
index <- index + 1
}
It is interesting to see that according to the afinn
dictionary, both brands have the exact same sentiment distribution.
Regarding the sentimentr package, it looks like the
sentiment scores are a tiny bit higher for Apple, but the difference is
negligible.
While there is a lot to say on these graphs, we will list the most interesting outcomes:
nrc
dictionary.nrc dictionary, it
seems that the iPhone 12 mini is the model that is most associated with
negative sentiments.afinn and sentimentr).Based on those results of the sentimentr package, we can
see that Pro Max models seem to have slightly higher sentiment scores
than the Pro and Base one. However, the difference seems to be
negligible.
For the unsupervised learning analysis, we will be analyzing the similarities between each of the models and the similarities between words that are present in the reviews. Out of the three distance measurements, we will be using Euclidean Distance as our distance/similarity measurement for this project as it compares the shortest distance among objects.
To read the euclidean distance from the matrix is quite complex since there are several models. Therefore, to highlight similarities in an easier way, we create a heatmap representation of the similarities between the reviews.
Looking at the heatmap of the grouped reviews below, we can conclude the following points:
Further, we have studied the graph that plots the euclidean distances based on the thickness of the line connecting them (Qgraph). The thicker the line, the more dissimilar or further the models are. Therefore, as we have seen in the heat map earlier and based on the thickness of the lines we see, this graphs also confirms our previous findings that Samsung Galaxy S21 FE and iPhone 11 are the most dissimilar models from all the other models.
##{-}
Moving on to the Clustering of Documents, we have used both hierarchical and kmeans models. We started by creating some dendrograms to show which models are similar to each other and at which stage are they clustered together. We can see here that one of the first clusters that was made was between iPhone 13 Pro and Samsung Galaxy 21 Ultra. Further, as you can see, we can choose to have four (4) clusters. Comparing the results of the dendrogram to our previous results, we can also confirm the dissimilarity level between the models and iPhone 11 and Samsung Galaxy S21 FE as they were added to the cluster in the last two iterations.
For the clustering below, we have chosen the complete linkage method as it provides almost the same results as average linkage method and shows clearly distict clusters.
Analyzing the results of the K-means on 4 clusters, we can see that
we have the ratio of Between Sum of Squares to the Total Sum of Squares
equal to around 90% making the ratio of Within Sum of Squares equal to
10%. These results look promising as we would like to increase the
Between Sum of Squares and decrease the Within Sum of Squares
After creating the dendrograms we analyzed the results of the kmeans
clustering on 4 clusters. We can see that we have the ratio of Between
Sum of Squares to the Total Sum of Squares equal to around 90% making
the ratio of Within Sum of Squares equal to 10%. These results look
promising as we would like to increase the Between Sum of Squares and
decrease the Within Sum of Squares
We have then created the 4 clusters and have chosen to extract the 10 words that are more often used in each cluster to get an insight on what each cluster talks about. We did that for both models and as we can see below the 4 clusters are the same regardless whether hierarchical clustering or kmeans model was used. Only the order was changed.
| Clust.1 | Clust.2 | Clust.3 | Clust.4 |
|---|---|---|---|
| iphone | scratch | iphone | fe |
| scratch | iphone | scratch | s20 |
| health | aesthetic | mini | samsung |
| renew | generic | s21 | s8 |
| refurbish | mica | health | ram |
| damage | renew | renew | entry |
| speaker | max | s22 | s21 |
| daughter | buyspry | samsung | snapdragon |
| transfer | health | refurbish | flagship |
| generic | pro | dirty | s9 |
| Clust.1 | Clust.2 | Clust.3 | Clust.4 |
|---|---|---|---|
| iphone | fe | scratch | iphone |
| scratch | s20 | iphone | scratch |
| health | samsung | aesthetic | mini |
| renew | s8 | generic | s21 |
| refurbish | ram | mica | health |
| damage | entry | renew | renew |
| speaker | s21 | max | s22 |
| daughter | snapdragon | buyspry | samsung |
| transfer | flagship | health | refurbish |
| generic | s9 | pro | dirty |
Then we do the similarities by words. We also create a heat map to show the similarities between words. Surprisingly, the heat map below does not show similarities between any of the words. The closest similarity between two words was between “life” and “saver” with around 0.55 cosine angle which makes sense since those two (2) words sometimes come together.
Then we do the similarities by words. We also create a heat map to show the similarities between words. The heat map below shows similarities between all of the words used in the reviews. However, the word “samsung” has the most dissimilarities with the others compared to the other words. Looking into details, to the similarities of the word “samsung”, we can see that the most similar words to it are “camera” and “fast”. Therefore, we can say that, between all the features, what most people commented about in the samsung phone is it’s speed and it’s camera.
On the other hand, if we look at the similarity matrix for the iPhone, we can see that is is similar to almost all the features that were mentioned which means that the consumers have mentioned those features evenly in their reviews. However, it is least similar to the words, “camera”, “fast”, and “card”. Which means that these words were not mentioned a lot in the iPhone reviews with respect to other words.
#> [1] "phone" "battery" "screen" "buy" "scratch"
#> [6] "iphone" "condition" "life" "purchase" "love"
#> [11] "day" "product" "camera" "time" "perfect"
#> [16] "arrive" "return" "excellent" "price" "charger"
#> [21] "charge" "recommend" "apple" "brand" "amazon"
#> [26] "happy" "issue" "fast" "samsung" "seller"
#> [31] "unlock" "box" "month" "receive" "protector"
#> [36] "quality" "renew" "bad" "expect" "sim"
#> [41] "refurbish" "money" "call" "card" "device"
#> [46] "fine" "review" "cell" "original" "worth"
The below dendrogram shows how each word is clustered and with the other words. we can c;early see four (4) clusters. As seen previously in the heatmap, the word samsung will be clustered in a cluster by itself as it is the most dissimilar word from the others.
Cooccurence describes how words occur together which in turn captures the different relationships between words. From there we can see thatthe most coocuuring words together are the words “phone” and “battery”. We can also clearly see that from the dendrogram.
We decided to use Topic modeling to discover the abstract “topics” that occur in a collection of documents, in this case the grouped corpus of smartphones Models (Wikipedia,2022). Topic Modeling will help us to identify the context of the documents by detecting similar words patterns inside them, and by clustering those group of words together.
Latent Semantic Analysis, or LSA, is one of the techniques that we will use for topic modeling. LSA is a reduction technique that decomposes the DTM into 3 matrices (\(M = 𝑈Σ𝑉^{𝑡}\)), where \(Σ\) represents the strength of the topic, \(𝑈\) the links among the document and every topic, and \(𝑉^{t}\) the links between the terms and each topic.
First, we started by plotting the first dimension to corroborate if it is associated with the document length, as it is known that this happens in LSA dim1. Looking at the result observed on the tab called Dimension 1, we can confirm that it is the case, as we detect that Dimension 1 is negatively correlated. Furthermore, we can see that the iPhone 11 is in the bottom-right hand side of the chart being the one with the highest amount of tokens, while Samsung Galaxy 21 Ultra is located on the top-left hand side with the lowest amount.
On the second tab “Topics 2 and 3”, we interpret which words are the top 5 associated to topics 2 and 3, and the top5 negatively associated to those topics. In the table for Topic 2 we detect that the words “scratch”, “iphone”, “condition”, “battery” and “product” are the ones associated to this topic, while “5g”, “s20”, “camera”, “phone” and “samsung” are negatively linked. We can say that Topic 2 can be identified on models from the brand Apple. On the other hand, in the table for Topic 3 we discover that the main words related to this topic were - pro, samsung, arrive, battery and camera - and the negative associated were brand, buy, iphone, unlock, and phone. This Topic may be seen on model reviews from the brand Samsung and some of the Pro models from Apple.
For a visual representation of the words stated previously on topic 2 and topic 3, we plot the dimensions that correspond to those topics (Dim 2 & 3) in a biplot chart. In the tab “Biplot of Dim 2 and 3” we can confirm the points mentioned before, as we note the same words associated with Dim 2 (Topic 2) and negatively associated, same case for Dim 3 (Topic 3). Samsung Galaxy S21 FE, iPhone 11 Pro Max and iPhone 11 Pro are associated with Topic 3, while iPhone 11 is unconnected to this topic. For Topic 2 we can determine that iPhone 11 Pro, iPhone 11 Pro Max, iPhone 11, and iPhone 12 are related to it.
| x | |
|---|---|
| scratch | 0.337 |
| iphone | 0.282 |
| condition | 0.269 |
| battery | 0.210 |
| product | 0.142 |
| 5g | -0.120 |
| s20 | -0.124 |
| camera | -0.151 |
| phone | -0.349 |
| samsung | -0.373 |
| x | |
|---|---|
| pro | 0.283 |
| samsung | 0.270 |
| arrive | 0.248 |
| battery | 0.217 |
| camera | 0.192 |
| brand | -0.110 |
| buy | -0.114 |
| iphone | -0.123 |
| unlock | -0.172 |
| phone | -0.232 |
Now, we apply the same approach as above, but in this case we will
use the TF-IDF matrix. As we have already explain, the TF-IDF quantifies
the relevance of a word in a document. First, we build the LSA object
with the textmodel_lsa function from the
quanteda.textmodels package with our matrix created for the
TF-IDF grouped model reviews(only 5 dimensions). Next, we break
down the data for interpretability by considering only the 5 words with
the highest values and the 5 with the lowest values. Finally, we plot
the Biplot to identify the Topics and its words associated and
unrelated.
What we discover is that Topic 2 (Dim 2) is associated to the words “scratch”, “iphone”, “generic”, “health”, “renew” and the models iPhone 11 Pro, iPhone 11, iPhone 11 Pro Max. Thereby, is unrelated to terms such as “s21”, “s8”, “samsung”, “s20”, “fe” and the model Samsung Galaxy S21 FE. For Topic 3 (Dim 3) we see a relation with terms like “aesthetic”, “mica”, “generic”, “pro”, “max” and iPhone 11 Pro, iPhone 11 Pro Max, but unrelated to iPhone 11 and the words “transfer”, “red”, “yellow”, “purple”, “daughter”.
Comparing both (DFM and TF-IDF) LSA approaches, we identify that for Topic 2 the words that are similar are “scratch” and “iphone”. This make sense, as we have seen that the main concern from iPhone reviewers is comparing their new model with previous models (usage of term iphone) or praising/complaining about the current status of the smartphone received (usage of term scratch). For Topic 3, we detect that terms changed significantly, meaning that the context of the topic would be different, with the exception of the word Pro.
Latent Dirichlet Allocation (LDA) is a topic modeling algorithm that is used to identify the topics present in a collection of documents. It is a generative model that assumes that each document is a mixture of a fixed number of topics, and that each word in the document is associated with one of the topics (Susan Li, 2018).
We decided to incorporate this algorithm to our dataset, so we build
the LDA object with the textmodel_lda function from the
seededlda package. We wanted to apply the same amount of
topics (5 topics) similar than the LSA DTM approach because we
tried with 10, 9, and 7 topics but the results were not meaningful.
Below we can observe the top 5 terms appearing on each topic
| topic1 | topic2 | topic3 | topic4 | topic5 |
|---|---|---|---|---|
| mini | phone | phone | pro | samsung |
| red | battery | battery | arrive | 5g |
| size | screen | scratch | max | galaxy |
| 128gb | buy | iphone | aesthetic | s20 |
| se | camera | condition | detail | fingerprint |
The \(ϕ\) (phi) is a term-topic distribution. It represents the probability of a term occurring in a given topic. To rephrase it, for a given topic, the \(ϕ\) values indicate the likelihood of each term being associated with that topic. The terms with the highest \(ϕ\) values are the ones that are most strongly associated with the topic. To visualize those associations we plot the \(ϕ\) values with the 10 largest probabilities terms inside each subject.
So we can interpret the following:
The θ matrix (theta) is a document-topic distribution. It represents the probability of a topic occurring in a given document. To put it in another way, for a given document, the θ values indicate the likelihood of each topic being present in that document.
Looking at the below chart, we can identify the following:
There are several ways to evaluate the quality of a topic model, for this case, we have consider to evaluate the LDA using metrics such as prevalence, coherence, and exclusivity.
Prevalence is a measure of how frequently a topic appears in the documents. A topic with a high prevalence is likely to be important and relevant to the overall collection of documents, while a topic with a low prevalence may not be as important or relevant.
Coherence is a measure of how well the words within a topic are related to each other. A topic with high coherence is likely to be more interpretable and easier to understand, while a topic with low coherence may be more difficult to interpret.
Exclusivity is a measure of how unique a topic is compared to the other topics in the model. A topic with high exclusivity is likely to be more distinct and easily separable from other topics, while a topic with low exclusivity may overlap with other topics and be harder to distinguish.
On the top left-hand side of the chart below, we can observe that the most prevalent topic among the models is Topic 2, this makes sense, as we have seen on the previous chart that this topic was a the most common. Additionally, we detect that Topic 2 is also the most coherent, whereas the least coherent is Topic 1. But we can note that Topic 1 is the most exclusive because its five terms are more specific to it.
The word embedding has been applied on our three reviews datasets:
smartphone_reviews_final.csv, Apple_final.csv, and Samsung_final.csv. To
embed words into 50-dimension vectors, we have decided to apply the
word2vec::word2vec function using the cbow
method with 30 iterations.
# # Read data
all_reviews <- read.csv(here::here("data/smartphone_reviews_final.csv"))
# # Train word2vec model
# Embed each word into a 50-dimension vector
all_model <- word2vec(tolower(all_reviews$Reviews), type = "cbow", dim = 50, iter = 30)
# Transform the result into a matrix
embedded_words <- as.matrix(all_model)
Here is an overview of the resulting matrix:
#> num [1:3762, 1:50] -1.58 -0.274 0.689 2.108 0.41 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ word : chr [1:3762] "doubtful" "great" "them" "serve" ...
#> ..$ dimension: chr [1:50] "V1" "V2" "V3" "V4" ...
To visualize the results in an interactive fashion, we used the
uwot::umap and the package plot_ly. In the
following 2D and 3D graphs, only a subset of the words have been plotted
in order to facilitate the visualization.
# # Read models
all_model <- read.word2vec(here::here("script/models/word2vec/all_model.bin"))
# # Create interactive plot of words
# source: https://cran.r-project.org/web/packages/word2vec/readme/README.html
embedded_words <- as.matrix(all_model)
viz <- umap(embedded_words, n_neighbors = 25, n_threads = 5, n_components = 3)
# Create the dataframe used for the plot
df <- data.frame(word = gsub("//.+", "", rownames(embedded_words)),
xpos = gsub(".+//", "", rownames(embedded_words)),
x = viz[, 1], y = viz[, 2], z = viz[, 3],
stringsAsFactors = FALSE)
# Subset the dataframe
set.seed(456)
nb_words_to_display <- 1500
df <- df[sample(1:nrow(df), nb_words_to_display),]
# Interactive 2D plot
graph_2d <- plot_ly(df, x = ~x, y = ~y, type = "scatter", mode = "text", text = ~word) %>%
layout(title = "2-Dimension word embedding (interactive graph)")
# Interactive 3D plot
graph_3d <- plot_ly(df, x = ~x, y = ~y, z = ~z, type = "scatter3d", mode = 'text', text = ~word) %>%
layout(title = "3-Dimension word embedding (interactive graph)")
We pre-processed the documents (reviews) by removing punctuation, lowercasing and splitting them into individual words. We then used the results from the word embedding to embed each review using the following function:
# This function takes 2 arguments: A word vector (= sentence/document) and a matrix of embedded words
# It uses each embedded word values to return the value of the word vector
# Here the input word vector should be the words in a given document
# It assumes that by averaging the vector values of the words found in the
# document it's possible to summarize the information contained in a document
# as a vector
document_embedding <- function(words, embedded_words_matrix){
# Keep words present in the embedded_words_matrix
look_up_words <- words[words %in% rownames(embedded_words_matrix)]
# Document embedding
# If there is more than one word in the document, do colSums
if (length(look_up_words) > 1){
document_embedding <- colMeans(embedded_words_matrix[look_up_words,])
# If length = 1, don't run colSums as it will throw an error
} else if (length(look_up_words == 1)){
document_embedding <- embedded_words_matrix[look_up_words,]
# If look_up_words is empty return a vector of zeros
} else {
document_embedding <- rep(0, ncol(embedded_words_matrix))
}
# Return embedded document
return(document_embedding)
}
# # Embed documents (reviews) using document_embedding function
# Remove punctuation
all_reviews_sentences <- gsub('[[:punct:] ]+',' ', all_reviews$Reviews)
# Split sentences into words (to lower and trimmed) - Resulting in a list
document_words <- lapply(strsplit(tolower(all_reviews_sentences), " "), trimws)
# Document embedding using own document_embedding function
embedded_documents <- lapply(X = document_words, FUN = document_embedding, embedded_words_matrix = embedded_words)
Here is an overview of the resulting matrix:
#> num [1:15297, 1:50] -0.293 -0.629 -0.319 -0.627 -0.278 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ review : chr [1:15297] "1" "2" "3" "4" ...
#> ..$ dimension: chr [1:50] "V1" "V2" "V3" "V4" ...
In this section, we are using our previously created embedded document matrix in order to cluster documents together. In theory, the clustering should generate groups of similar documents. This is useful to understand the main topics and themes discussed by customers in their reviews.
For the clustering approach, we decided to create a total of 7 groups
with nstart=20. The graph below includes a subset of 500
reviews showing the results.
# # Read all reviews
all_reviews <- read.csv(here::here("data/smartphone_reviews_final.csv"))
# # Read embedded document matrix
embedded_documents <- select(read.csv(here::here("script/unsupervised_learning/embedding/embedding_data/embedded_documents.csv")), -X)
# # Clustering of documents
set.seed(650)
number_of_centers <- 7
all_reviews_clustering <- kmeans(embedded_documents, centers = number_of_centers, nstart = 20)
# Store size of clusters
cluster_size <- tibble(cluster = seq(1:number_of_centers),
number_of_reviews = all_reviews_clustering$size)
In order to capture the topics discussed in each cluster, we decided to:
# Create dataset for plotting: Get top words per cluster
cluster_topics <- all_reviews %>%
mutate(review_id = seq(1:nrow(all_reviews))) %>%
relocate(review_id, .before = "Brand") %>%
unnest_tokens(output = "word",
input = "Reviews",
to_lower = TRUE,
strip_punct = TRUE,
strip_numeric = TRUE) %>%
filter(word %notin% stop_words$word) %>%
group_by(cluster, word) %>%
tally() %>%
mutate(freq = n/sum(n)) %>%
filter(n >= 3) %>%
arrange(cluster, desc(freq)) %>%
top_n(15, freq) %>%
left_join(cluster_size, by = c("cluster" = "cluster")) %>%
mutate(cluster_header = paste0("Cluster ", cluster, "\nn_reviews = ", number_of_reviews, ", n_words = ", round(n/freq), "\nav_length = ", round(round(n/freq)/number_of_reviews, 2)))
Thanks to the results of the clustering approach, we are able to observe several patterns in the reviews.
excellent, perfect and
perfectly.As we can see, it seems that the clustering managed to surface groups of positive comments. However, it is quite surprising to see that the only cluster of negative comments contains only 203 reviews and mostly talks about the quality of the media play and speakers. It would be worth analyzing large clusters that do not seem to surface particular sign of satisfaction to better understand their content (1, 4, 5). Some approaches include increasing the number of clusters or ‘cluster the clusters of interest’.
This part of the report attempts to use the results of the sentiment analysis and document embedding to train a random forest algorithm to predict reviews’ sentiment. The following dataset (15’297 x 53) has been used to train the algorithm (only the first and last dimension of the document embedded matrix are displayed):
# # Read sentiment score per review
sentimentr_documents <- read.csv(here::here("script/sentiment_analysis/sentiment_data/sentimentr_per_review.csv")) %>%
select(review_id, brand = Brand, model = Model, sentimentr_score = sentimentr_value)
# # Read document embedded matrix
embedded_documents <- select(read.csv(here::here("script/unsupervised_learning/embedding/embedding_data/embedded_documents.csv")), -X) %>%
mutate(review_id = as.integer(rownames(.))) %>%
relocate(review_id, .before = V1)
# Join sentiment and embedding matrix to form the dataset
dataset <- left_join(embedded_documents, sentimentr_documents, by = "review_id") %>%
select(-review_id)
dataset$brand <- as.factor(dataset$brand)
dataset$model <- as.factor(dataset$model)
#> 'data.frame': 15297 obs. of 5 variables:
#> $ V1 : num -0.293 -0.629 -0.319 -0.627 -0.278 ...
#> $ V50 : num 0.328 0.188 0.81 0.395 0.113 ...
#> $ brand : Factor w/ 2 levels "Apple","Samsung": 1 1 1 1 ..
#> $ model : Factor w/ 20 levels "iPhone 11","iPhone 11 Pr"..
#> $ sentimentr_score: num 0.2157 0.0709 -0.075 0.5162 0.26 ...
The following code has been used to train the random forest (100, 500 and 1000 trees):
# # Training and testing dataset
# Split the data
set.seed(456)
split <- sample.split(dataset, SplitRatio = 0.75)
train <- subset(dataset, split == "TRUE")
test <- subset(dataset, split == "FALSE")
# # Train the model: Random forest regressor
nb_trees <- 100 # 500, 1000
fit <- randomForest(data = train,
sentimentr_score ~ .,
ntree = nb_trees,
mtry = 25,
importance=TRUE)
# Save the model
save(fit, file = paste0(here::here("script/models/randomForest/rf_", nb_trees, ".RData")))
The accuracy of the model has been measured on the test set using the
caret::RMSE function and results have been plotted in order
to visualize prediction accuracy. The plot below contains subset of 2500
reviews out of 4040 in the test set:
# # Evaluate model accuracy on test set
predictions <- predict(fit, test[1:ncol(test)-1])
accuracy <- RMSE(predictions, test[["sentimentr_score"]])
# Data frame including actual values and predictions
results <- tibble(actual = test[["sentimentr_score"]],
predictions = predictions) %>%
arrange(actual)
results["paired"] <- 1:nrow(results)
results <- melt(results, id.vars = "paired")
# Data to be plotted
set.seed(500)
plot_data <- filter(results, paired %in% sample(1:nrow(results), 5000))
plot_actual <- filter(plot_data, variable == "actual")
This graph shows that the random forest model is performing pretty
poorly overall. It performs particularly bad when it comes to predicting
extreme sentiment values. Also, it looks like it is predicting between
0.5 and -0.1 at random, no matter this input. One of the main
explanation is that the training of the model is based on approximate
labelling of the data. Indeed, the sentiment score is given by the
sentimentr package which is already limited in accuracy.
Also, the embedding of documents is itself based on the assumption that
reviews can be summarised in 50-dimension vectors averaging the
50-dimension word vectors it contains. To improve its accuracy, it might
be worth trying to increasing the number of dimensions of those vectors
in addition to increasing the size of the training set. Furthermore,
some improvements could be made on the sentiment score attribution. It
could be that some natural language processing deep learning algorithm
would perform better in capturing the sentiments of the reviews.
In conclusion, the data preparation for this study involved collecting smartphone reviews from the US Amazon marketplace for Apple and Samsung phones and cleaning and prepping the data by detecting and translating non-English reviews and removing non-conforming formats, symbols, numbers, and separators, as well as “stop words” and performing lemmatization to reduce repetition. We have found that the most common words in all the reviews were “phone,” “battery,” and “screen,” and that the main topics of concern for consumers were battery life and screen quality.
Deep diving into the analysis of the reviews by conducting the sentiment analysis revealed that, according to the nrc dictionary, Apple reviews contained more trust and joy compared to Samsung reviews, and that Apple phones also triggered relatively less sadness, fear, and disgust. We noted that iPhone 11 reviews in general are a bit different than the other phones reviews.
Further, the unsupervised learning analysis showed that the iPhone 11 and Samsung Galaxy S21 FE were the most dissimilar reviews and that the word “samsung” had the most dissimilarities with other words. We could have explained that the dissimilarities of the iPhone 11 reviews is explained by the large number of reviews on it compared to other phones which made it more diverse and talking about different aspects. However, as we have seen in the lexical diversity, it is the least diverse of all the reviews. The same has been noticed when doing the topic modeling technique Latent Semantic Analysis (LSA) since both iPhone 11 and Samsung Galaxy S21 FE were clearly orthogonal to most of the other models. The topic modeling technique Latent Dirichlet Allocation (LDA) identified common patterns and topics within the documents, such as battery life, screen quality, camera performance, and the sentiment of consumers towards the phones and sellers.
As we have seen for the word and document embedding the results showed that the clustering was able to group similar reviews together and surface patterns in the data. The cluster sizes were not uniformly distributed, and the average length of reviews varied among the clusters.Various clusters were made depending on the content such as battery and battery life, phone conditions, happy customers and bad comments.
Finally, for the supervised learning analysis with the usage of random forest and gradient boosting performed poorly. We found that the most important features in predicting the rating of a review were the phone characteristics, screen size, and storage capacity.
The overall analysis of smartphone reviews on Amazon US provided insights about consumers’ reactions and sentiments about iPhone and Samsung models and their sellers. Based on the conclusion provided, some recommendations for future studies could include:
Expanding the dataset to include reviews from other marketplaces or countries to see if the findings are consistent or if there are any regional differences in consumer preferences.
Using additional techniques for sentiment analysis, applying other dictionaries or incorporating machine learning algorithms, to see if the results change or are more accurate.
Investigating the influence of other factors, such as the seller or brand reputation, on consumer opinions and ratings.
Applying more advanced machine learning techniques, such as deep learning, to see if they improve the accuracy of the rating predictions.
Examining the impact of specific phone features, such as camera quality or durability, on consumer ratings and preferences.
Analyzing the reviews over time to see how consumer opinions change as new models are released and technology advances.